What data to use in introductory statistics and data science courses? Ideally data that’s
On the one hand, Cobb (2015) argues that we should
- “Teach through research”
- “Minimize prerequisites to research”
love @JennyBryan's analogy of classroom data as teddybears & real data like a grizzly bear with salmon blood dripping out its mouth #jsm2015
— sandy griffith (@sgrifter) August 11, 2015
In other words, a balancing act is required between
| Data with no prerequisites needed | Data as it exists “in the wild” |
|---|---|
Data “taming” sets out to
We propose the following “tame” data principles to remove biggest hurdles R novices face.
- Clean variable names
- ID variables in left-hand columns
- Clean dates (More generally: clean numerical representations)
- Clean categorical variables
- Consistent “tidy” format
The fivethirtyeight R package:
Examples are in R, so I suggest you follow in HTML version of this talk available at bit.ly/causeweb_tame
flying-etiquette.csvlibrary(readr)
library(fivethirtyeight)
# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]## [1] "Do you have any children under 18?"
## [2] "In general, is itrude to bring a baby on a plane?"
# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]## [1] "children_under_18" "baby"
Working with variables names that are long/unwieldy and have spaces is a tricky.
mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`,
data = flying_raw, main = "Raw data",
xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
data = flying_raw, main = "Raw data",
xlab = "Have a baby?", ylab = "Is it rude?")More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.
library(dplyr)
library(fivethirtyeight)
# Both title and imdb site tag uniquely identify movies
biopics %>%
sample_n(3)| title | site | country | year_release | box_office | director | number_of_subjects | subject | type_of_subject | race_known | subject_race | person_of_color | subject_sex | lead_actor_actress |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Schindler’s List | tt0108052 | US | 1993 | 96100000 | Steven Spielberg | 1 | Oskar Schindler | Other | Known | White | FALSE | Male | Liam Neeson |
| The Diary of Anne Frank | tt0052738 | US | 1959 | NA | George Stevens | 1 | Anne Frank | Other | Known | White | FALSE | Female | Millie Perkins |
| Lady Sings the Blues | tt0068828 | US | 1972 | 9600000 | Sidney J. Furie | 1 | Billie Holiday | Musician | Known | African American | TRUE | Female | Diana Ross |
# episode uniquely identifies episodes of "The Joy of Painting"
bob_ross %>%
sample_n(3)| episode | season | episode_num | title | apple_frame | aurora_borealis | barn | beach | boat | bridge | building | bushes | cabin | cactus | circle_frame | cirrus | cliff | clouds | conifer | cumulus | deciduous | diane_andre | dock | double_oval_frame | farm | fence | fire | florida_frame | flowers | fog | framed | grass | guest | half_circle_frame | half_oval_frame | hills | lake | lakes | lighthouse | mill | moon | mountain | mountains | night | ocean | oval_frame | palm_trees | path | person | portrait | rectangle_3d_frame | rectangular_frame | river | rocks | seashell_frame | snow | snowy_mountain | split_frame | steve_ross | structure | sun | tomb_frame | tree | trees | triple_frame | waterfall | waves | windmill | window_frame | winter | wood_framed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S04E03 | 4 | 3 | MAJESTIC MOUNTAINS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| S15E06 | 15 | 6 | WAVES OF WONDER | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| S24E07 | 24 | 7 | BACK COUNTRY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
US_births_1994-2003_CDC_NCHS.csvlibrary(readr)
library(fivethirtyeight)
# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)| year | month | date_of_month | day_of_week | births |
|---|---|---|---|---|
| 1994 | 1 | 1 | 6 | 8096 |
| 1994 | 1 | 2 | 7 | 7772 |
| 1994 | 1 | 3 | 1 | 10142 |
| 1994 | 1 | 4 | 2 | 11248 |
| 1994 | 1 | 5 | 3 | 11053 |
| 1994 | 1 | 6 | 4 | 11406 |
# Tamed data: variable date of type "date" included
head(US_births_1994_2003)| year | month | date_of_month | date | day_of_week | births |
|---|---|---|---|---|---|
| 1994 | 1 | 1 | 1994-01-01 | Sat | 8096 |
| 1994 | 1 | 2 | 1994-01-02 | Sun | 7772 |
| 1994 | 1 | 3 | 1994-01-03 | Mon | 10142 |
| 1994 | 1 | 4 | 1994-01-04 | Tues | 11248 |
| 1994 | 1 | 5 | 1994-01-05 | Wed | 11053 |
| 1994 | 1 | 6 | 1994-01-06 | Thurs | 11406 |
Without a variable of type date, making time series plots is difficult.
# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l",
xlab = "Date", ylab = "Number of births", main = "1999 US Births")movies.csvlibrary(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")
# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]## [1] "notalk" "ok" "notalk" "notalk" "men"
# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]## [1] notalk ok notalk notalk men
## Levels: nowomen < notalk < men < dubious < ok
R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:
# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
geom_bar() +
labs(x = "Bechdel test outcome", y = "count", title = "Raw data")
# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
geom_bar() +
labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")“Tidy” data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset “tidy”:
drinks.csvlibrary(dplyr)
library(ggplot2)
library(fivethirtyeight)
# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol |
|---|---|---|---|---|
| Afghanistan | 0 | 0 | 0 | 0.0 |
| Albania | 89 | 132 | 54 | 4.9 |
| Algeria | 25 | 0 | 14 | 0.7 |
| Andorra | 245 | 138 | 312 | 12.4 |
| Angola | 217 | 57 | 45 | 5.9 |
| Antigua & Barbuda | 102 | 128 | 45 | 4.9 |
# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
arrange(country)
head(drinks_tidy)| country | total_litres_of_pure_alcohol | type | servings |
|---|---|---|---|
| Afghanistan | 0.0 | beer_servings | 0 |
| Afghanistan | 0.0 | spirit_servings | 0 |
| Afghanistan | 0.0 | wine_servings | 0 |
| Albania | 4.9 | beer_servings | 89 |
| Albania | 4.9 | spirit_servings | 132 |
| Albania | 4.9 | wine_servings | 54 |
ggplot(drinks_tidy, aes(x = type, y = servings)) +
geom_boxplot() +
labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")clinton.csvtrump.csvIn the tamed pres_2016_trail data frame we:
lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)candidate (Principle 5: Tidy data format)library(dplyr)
library(fivethirtyeight)
# Tamed data:
pres_2016_trail %>%
arrange(date) %>%
head()| candidate | date | location | lat | lng |
|---|---|---|---|---|
| Trump | 2016-09-01 | Wilmington, OH | 39.44534 | -83.82854 |
| Trump | 2016-09-03 | Detroit, MI | 42.33143 | -83.04575 |
| Clinton | 2016-09-05 | Cleveland, Ohio | 41.49932 | -81.69436 |
| Clinton | 2016-09-05 | Hampton, Illinois | 41.55587 | -90.40930 |
| Clinton | 2016-09-06 | Tampa, Florida | 27.95058 | -82.45718 |
| Trump | 2016-09-06 | Virginia Beach, VA | 36.85293 | -75.97799 |
So we can easily create a faceted map!
library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
facet_wrap(~candidate) +
geom_point(col = "black", size = 2) +
coord_map() +
# Override data & aes()thetic mapping set above to trace path of state outlines:
geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)
- Recruited STAT231 Data Science to “tame” datasets STAT135 Intro students found for their final projects
- Available on GitHub: data wrangling source code by package authors to convert 538 raw CSV data to “tamed” format
process_data_sets_albert.R,process_data_sets_chester.R,process_data_sets_jen.R
fivethirtyeightpackage is in maintenance mode: no new development, only need to add new datasets- Internship model of learning/development: learning R package construction, GitHub, communication and project management skills, etc.
- RStudio’s 2018
broompackage summer internship follows a similar model.- Undergraduate student written data wrangling source code to convert 538 raw CSV data to “tamed” format
process_data_sets_maggie.R,process_data_sets_meredith.R
Comments
fivethirtyeightis like a data petting zoofivethirtyeightpackage used in other contexts: